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ABSTRACT 

In this paper, a novel framework of sparse kernel learning 
for Support Vector Data Description (SVDD) based anomaly 
detection is presented. In this work, optimal sparse feature 
selection for anomaly detection is first modeled as a Mixed In¬ 
teger Programming (MIP) problem. Due to the prohibitively 
high computational complexity of the MIP, it is relaxed into 
a Quadratically Constrained Linear Programming (QCLP) 
problem. The QCLP problem can then be practically solved 
by using an iterative optimization method, in which multiple 
subsets of features are iteratively found as opposed to a single 
subset. The QCLP-based iterative optimization problem is 
solved in a finite space called the Empirical Kernel Feature 
Space (EKFS) instead of in the input space or Reproducing 
Kernel Hilbert Space (RKHS). This is possible because of 
the fact that the geometrical properties of the EKFS and the 
corresponding RKHS remain the same. Now, an explicit non¬ 
linear exploitation of the data in a finite EKFS is achievable, 
which results in optimal feature ranking. Experimental re¬ 
sults based on a hyperspectral image show that the proposed 
method can provide improved performance over the current 
state-of-the-art techniques. 

Index Terms — Sparse kernel learning. Optimal feature 
selection. Empirical kernel feature space. Empirical kernel 
map 

1. INTRODUCTION 

Feature selection for learning algorithms aims to find a rele¬ 
vant subset of features that can improve the learning perfor¬ 
mance by discarding features not useful or even harmful for 
the given tasks. In the case of kernel-based anomaly detec¬ 
tion, such as SVDD, the feature selection requires the accu¬ 
rate estimation of the contribution of each feature to the ob¬ 
jective function, i.e., the radius of a hypersphere in the RKHS. 

In this paper, a new framework of optimal sparse kernel 
learning for SVDD-based anomaly detection (OSKLAD) is 
proposed. The proposed OSKLAD optimally extends the fea¬ 
ture selection technique used for the kernel-based learning 
approaches Q into SVDD-based anomaly detection by fully 
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optimizing the feature selection method for nonlinear kernels 
in a newly defined finite space called the EKFS m Hence, 
the OSKLAD can be considered as a fully optimized version 
of the wrapper approach to the SVDD-based anomaly detec¬ 
tion with nonlinear kernels. The initial objective of the pro¬ 
posed OSKLAD begins with finding a single subset of origi¬ 
nal features that can be used to build an optimal hypersphere 
in the RKHS. This objective can be modeled as a Mixed Inte¬ 
ger Programming (MIP) problem. However, the MIP problem 
is NP-hard, and so the MIP model is relaxed into a Quadrat¬ 
ically Constrained Linear Programming (QCLP) problem m 
by converting the objective function of the MIP problem into 
lower bounded quadratic inequality constraints. This QCLP 
problem is yet intractable due to the prohibitively large num¬ 
ber of the inequality constraints. To address this issue, a cut¬ 
ting plane method based on the restricted master problem 
coupled with Multiple Kernel Leaning (MKL) Ql is itera¬ 
tively used. The goal is to find only a small subset of the 
inequality constraints that are actively used to define the fea¬ 
sible region of the parameters of the given QCLP problem. 

The active constraints are effectively identified by find¬ 
ing the most violating constraints instead whose half-planes 
maximally violate the corresponding inequality constraints. 
Therefore, the task becomes finding multiple subsets of most 
violated features associated with the corresponding most vio¬ 
lating constraints given the objective function, such as the ra¬ 
dius of a hypersphere in the RKHS. However, finding the most 
violating constraints also becomes a combinatorial problem, 
if nonlinear kernels, such as Gaussian RBF kernel or high 
order polynomial kernels, are used, due to the prohibitively 
large number of possible combinations (subsets) of the origi¬ 
nal features. To tackle this issue, in the proposed OSKLAD, 
the most violated features are found in the EKFS. The EKFS 
is a finite space linearly spanned by basis vectors, which are 
generated by a map, called the Empirical Kernel Map (EKM). 
It is shown that the EKHS and the corresponding RKHS con¬ 
structed by using the same kernel function have the same ge¬ 
ometrical property. This means that solutions of any opti¬ 
mization problem obtained from either space are identical. In 
the proposed OSKLAD, the subsets of the most violated fea- 


tures are optimally found in the EKFS since individual feature 
ranking in terms of contribution to the radius in the EKFS can 
be performed optimally based on the property of canonical 
dot product and the finite dimensionality of the space. 


2. OPTIMAL SPARSE KERNEL LEARNING 

In this section, we present an optimal sparse kernel learning 
for anomaly detection (OSKLAD) using SVDD as a basic 
building block. Inspired by the feature selection approach for 
the kernel-based classification m, the OSKLAD addresses 
the problem of the optimal feature selection for the SVDD- 
based anomaly detection. The basic formulation of OSKLAD 
is to minimize the radius of the enclosing hypersphere while 
allowing outliers except that in OSKLAD, only a subset of 
features is used. So, the model is described as a mixed inte¬ 
ger programming problem: 

N 

min min C A 

i=i 

subject to ||T>(xi) — a|p < ^ 

ei>o 

Xi=Xi(Dd, i = 1 , 2 ,..., AT, 

where d g D = {d\dj G = 

1,2,...,M}, and 0 represents elementwise product. Here 
5 is a threshold that controls the number of features that are 
selected. If one assumes that d is fixed in Eq. it turns 
into a continuous constrained optimization problem just like 
a standard SVDD. By applying the Langrange multipliers and 
KKT conditions to it, we can derive the dual problem (similar 
to standard SVDD) as: 
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subject to Cii = 1 (2) 

i=l 

0 < tti < c 

Xi =XiOd, i = 1,2,..., AT. 

However, one should notice that Eq. is still a mixed in- 
terger programming (MIP) problem due to the last constraint, 
which is computationally expensive to solve. In order to solve 
this problem, it can be converted into a Quadratically Con¬ 
strained Linear Programming (QCLP). We define S{a, d) = 
aik{xi,Xi) - aiajk{xi,Xj), and introduce an 

additional parameter t to obtain the QCLP equivalent of ^ 


as follows: 

max t 

a,t 

N 

subject to ai — X 

i=l 

0 < < C 

t < S{a,d), Vd G D. 

Though Eq. |^is convex, a large number of inequality con¬ 
straints (last condition in Eq. makes it impractical to be 
solved by existing techniques. The number becomes huge if 
the features reside in a high dimensional space. Note that not 
all the inequality constraints used in Eq. [^are actively used in 
defining the feasible region of the optimization problem. In 
fact, only a small number of the constraints are useful and di¬ 
rectly used to solve the optimization problem. Therefore, an 
iterative algorithm can be used, in which instead of solving 
Eq. 1^ at once, an intermediate solution pair (t, a) is itera¬ 
tively updated based on a limited subset of previously found 
active constraints. This optimization problem is called the re¬ 
stricted master problem, which is closely related to the cutting 
plane algorithm described in O . The restricted master prob¬ 
lem consists of two steps O: 1) (t, a) are optimized based on 
a previously found restricted subset X of features, which max¬ 
imally violates the constraints; and 2) a new vector d of the 
most violated features is obtained based on newly optimized 
{t, a) in step 1 and added to the restricted subset X = X |J d. 
These two steps are iterated until convergence |71 . Finding d 
of the most violated features is detailed in the next subsection. 

The intermediate solution pair (f, a) is now obtained from 
the following optimization problem 

max t 

a,t 

N 

subject to = 1, 

i=l 

0<ai <C, 
t < S{a,d^), d' el. 

Let IJ.I > 0 be the dual variable for each constraint in Eq. 
The Lagrangian of Eq. |^can be written as: 

P 

+ (5) 

1 = 1 

By setting ^ = 0, we have = 1- The Lagrangian 

L{t^p), after applying this partial KKT condition, can be 
rewritten as L(t^p) = Yl^i=i which transforms 


© to the following problem: 
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subject to > ai = 1 
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0 < ai < C for i = 2 ..., N 

p 

= l,fii >0forl = 1,2...,p. 
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One can observe that can be solved using a two-step iterative 
process to obtain optimal sparse weights of individual kernels 
fi and optimal lagrange multipliers a* (which define the sup¬ 
port vectors or the enclosing hypersphere). 

3. OPTIMAL FEATURE SELECTION: FINDING 
MAXIMALLY VIOLATING FEATURES 

For updating d, the features that maximally violate the last 
constraint in Eq. [^need to be determined. Since the goal of 
Eq. [^is to maximize t, and it is upper-bounded by S{a^ d) 
according to the constraint, the features that maximally vio¬ 
late this constraint will minimize S{a,d). One has to solve 
the following optimization problem: 

min S{a^ d) 

d 

M 

subject to di = B (7) 

i=l 

di G {0,1}. 

In this section, we describe the method to find these feature 
vectors for both linear kernel and non-linear kernel. 

3.1. Linear Kernel 

If a linear kernel is used, since k (x^, Xj) = (x^, x^), we 
have S{a,d) = Yjj=idjCj, where Cj = Yji=i^i^‘ij + 

{Ylf=i OLiXij)‘^. S{a,d) is a linear function of d. Once we 
have optimal support vectors, the global solution of d can be 
easily obtained by sorting Cj’s in ascending order and setting 
the first B corresponding elements in d, dj to 1 and the rest 
to 0. Once the optimal feature subset is chosen for a kernel, 
optimal a and fi are updated by solving Eq. These two 
steps are repeated until the algorithm converges. 

3.2. Non-linear Kernel 

If a Gaussian RBF kernel is used, S{a, d) is not a linear func¬ 
tion of d. We cannot solve the problem in Eq. [7] optimally 
because of the large number of combinations of features that 
have to be considered. So, the data is tranformed from infinte 


dimensional RKHS into another space called empirical kernel 
feature space (EKES) with finite dimensionality using empir¬ 
ical kernel map (EKM). This will allow us to select subsets 
of features optimally while still preserving the nonlinear cor¬ 
relations among the features. Eor a given set of training data 
points the map defined by 

^ where x i-G /c (•, x) = {k (xi, x),..., /c (x^, x))^ 

( 8 ) 

is called the EKM with respect to ||2l. However, the 

kernel function k used to build kernel matrices in previous 
subsections cannot be represented using since they do 
not form an orthonormal system. The dot product to use in 
the representation of k is the not the canonical dot product 
in the EKES W^. In order to turn into a feature map as¬ 
sociated with k, EKES is endowed with a dot product (*, •)n 
such that k{xi,Xj) = (x^), (xj))^. After analyzing 

certain conditions using this equality as shown in O, the dot 
product (', •)n converted to a canonical dot product by 

merely whitening the EKES and using the new basis functions 
as features. It can be represented as 

fc(xj, Xj) = (xi), {Xj)) , (9) 

where the feature map in whitened EKES is given by 

: xiH- K~^ (fc(xi,x),...,fc(x„,x))^. (10) 

where K is the Gram matrix and Ki^j = /c(x^, Xj ). The kernel 
function in Eq. [^is used to build the kernel matrices in Eqs. 
[2|[7| Hence, the feature subset selection problem turns exactly 
into (|7]) (linear version) except for the fact that in this case the 
features are selected in EKES. Similar to the OSKLAD with a 
linear kernel, the overall Optimal Sparse Kernel Learning for 
Anomaly Detection (OSKLAD) in the EKES is described in 
Algorithm 1. 


Algorithm 1 OSKLAD with nonlinear kernel 

1: Map the data points into the EKES by using a certain ker¬ 
nel k 

2: Initialized: a = ^1, find the maximally violating feature 
subset d, and set I = {d}. 

3: Run SKAD based on the kernel matrices generated by I 
and optimize for a and p. 

4: Eind the next maximally violated feature subset d based 
on the current a and p and set X = X 

5: Repeat steps 3-4 until convergence. 


4. SIMULATION RESULTS 

In this section, the performance of OSKLAD is evalu¬ 
ated on a hyperspectral digital imagery collection exper- 
iment(HYDICE) image, which contains 30 small painted 
pannels located in the background. We chose a small patch 





(69 pixels X 10 pixels) as the background data set, which is 
used to obtain the radius R and the center of the hypersphere. 
The distance of each test pixel in the image to the center 
of the hypersphere is determined. If the distance is greater 
than R, the pixel is considered as an anomaly, otherwise, it 
is a background pixel. In our experiments, the performance 
of SVDD, SKAD EQ and OSKLAD with both linear and 
Gaussian RBF kernels are compared with one another. For 
SVDD and SKAD, both linear and Gaussian RBF kernel are 
used in the input space. For OSKLAD with linear kernel, 
feature selection is performed in the input space. However, 
for OSKLAD with Gaussian RBF kernel, the input vector is 
first mapped into EKFS using EKM. At this point, we can just 
use linear kernel in EKFS, which translates to using Gaussian 
RBF kernel in the input space as described in the previous 
sections. The kernel bandwidth parameter is determined by 
implementing the minimax technique on randomly selected 
10 regions of the image to represent the background as done 
in a. The same value is used over all the test pixels in the 
image for all the algorithms. 

The number of features used for each hypersphere of 
SKAD with both linear and Gaussian RBF kernel and OS¬ 
KLAD with linear kernel is 75, which is half of the total 
number of features. For OSKLAD in the EKFS, the total 
number of features available after mapping the pixels from 
input space to EKFS is reduced to 96, and we have used 
48 features for each hypersphere. Figj^ shows the anomaly 
detection results for SVDD, SKAD and OSKLAD with both 
linear and Gaussian RBF kernels. The value of each pixel in 
the results is the ratio of the distance between the pixel and 
the radius of the hypersphere. For comparison, we normalize 
the scaled in all the resulting images to be between 0 and 
1. One can see that all the six methods are able to identify 
the first two rows of anomalies, but OSKLAD in EKFS can 
identifies anomalies with much less noise (clean background) 
and it is also able to detect the small targets in the third row. 





(b) SVDD - RBF kernel 



(d) SKAD - RBF kernel 


(f) OSKLAD - EKFS 


Fig. 1. Anomaly detection results of the HYDICE image us¬ 
ing SVDD, SKAD and OSKLAD 
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5. CONCLUSIONS 

In the proposed work, to achieve optimality in kernel-based 
feature selection for anomaly detection using SVDD, the 
QCLP problem is optimally solved in a new finite space 
called the Empirical Kernel Eeature Space (EKES) instead 
of the RKHS. Experimental result show that by optimally 
selecting features, significant improvements can be made in 
hyper spectral anomaly detection in EKES rather than the 
original input space. 
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